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Abstract 

In a recent paper on 'Estimating Species Trees from Unrooted Gene 
Trees' Liu and Yu observe that the distance matrix on the underlying 
taxon set, which is built up from expected internode distances on gene 
trees under the multispecies coalescent, is tree-like, and that the un- 
derlying additive tree has the same topology as the true species tree. 
Hence they suggest to use (observed) average internode distances on 
gene trees as an input for the neighbor joining algorithm to estimate 
the underlying species tree in a statistically consistent way. In this 
note we give a rigorous proof of their above mentioned observation. 

1 Introduction 

One of the possible reasons for discordance of a gene tree with an under- 
lying species tree is the phenomenon of incomplete lineage sorting, which 
is described by the multispecies coalescent model. Many authors have ad- 
dressed the problem of reconstructing the underlying species tree from a set 
of discordant gene trees, both from a theoretical perspective (e.g. Maddison 
[7], Allman et al. [1], and many others), as well as from an practical resp. 
algorithmic perspective (see e.g. Ewing et al. [3J, Liu et al. [5], Than and 
Nakhleh [8], Kreidl [6]). Recently, Liu and Yu have published a paper [4| 
in which they propose to estimate the expected number of internodes be- 
tween any two taxa on gene trees by averaging over the observed numbers 
of internodes (on the observed gene trees), for any pair of taxa. They note 
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Theorem 1 (Liu and Yu, |4j). Under the multispecies coalescent model on 
any fixed species tree, the expected number of internodes between two taxa on 
gene trees determines a tree-like metric on the taxon set, and the underlying 
tree topology is identical with the topology of the species tree. 

Hence, they conculde, by applying the neighbor joining algorithm to the 
matrix of average internode distances (obtained from observed gene trees) is 
a statistically consistent way to estimate the true species tree. It is the goal 
of this note to give a rigorous and detailed proof of Liu and Yu's theorem 
above. 

We start by collecting, in Section [21 a few well-known facts on tree-like 
metrics, as well an easy reformulation of the four-point condition in terms of 
weights of quartets (for the terminology of weights of quartet trees see e.g. 
Sturmfels and Pachter [9] ) . Section [3l finally, contains the precise statement 
of Liu and Yu's theorem together with its proof. The proof consists essen- 
tially in checking that the 'weight-version' of the four-point condition from 
Section [2] holds for the matrix of expected numbers of internodes. 

1 would like to thank Liang Liu for his interest in this modest note. 

2 Preliminaries on tree-like metrics 

In the following let T be a finite set and let D : T X T — >■ M>o be a metric 
on T. 

Definition 2. The metric D satisfies the four-point condition if the maxi- 
mum of the three numbers 

D(a,b) + D(c,d), D(a,c) + D(b,d), D(a,d) + D(b,c) (2.1) 

is attained at least twice, for every four- element subset {a,b,c,d} C T. 

For every four taxon subset {a, b, c,d} C T we define the weight of the 
quartet (ab,cd), according to the exposition by Pachter and Sturmfels [9j, 
to be the number 

w(ab,cd) := WD(ab,cd) = 

= D(a, c) + D(a, d) + D(b, c) + D(b, d) - 2D(a, b) - 2D(c, d). (2.2) 
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In the following we will sometimes consider weights with respect to different 
metrics D, which will be indicated by a lower index. 

Definition 3. The metric D is said to satisfy the weight condition if the 
minimum of the three numbers 

WD(ab,cd), WD(ac,bd), wr>(ad,bc) (2-3) 

is attained at least twice, for every four- element subset {a,b,c,d} C T. 

Remark 4. (1) The sum of the three numbers in the weight condition is 
always 0. Thus their minimum is strictly negative and their maximum is 
strictly positive as soon as not all three numbers vanish. 

(2) If the metric D is tree-like and the underlying tree T displays the 
quartet (ab,cd), then w(ab,cd) = Ax and w(ac,bd) = w(ad,bc) = —2x, 
where x is the distance between the paths connecting a and b, and c and d, 
respectively. 

Lemma 5. (1) For a metric D the four-point condition and the weight 
condition are equivalent. 

(2) The metric D is tree-like if and only if these conditions hold. 

(3) If D is tree-like and T is the underlying tree, then T displays the 
quartet [ah, cd) if and only if D(a, b) + D(c, d) is the minimum of the three 
numbers in the four-point condition if and only if w{ab,cd) is the maximum 
of the three numbers in the weight condition. 

Proof. It is well known that if D satisfies the four point condition then D is 
tree-like, and moreover that the underlying tree displays the quartet (ab, cd) 
if and only if the minimum in the four point condition is attained at D(a, b) + 
D(c,d) (for a proof see e.g. Pachter and Sturmfels [9]). From the remark 
above it follows that if D is tree-like with the underlying tree displaying 
the quartet (ab,cd), then it satisfies the weight condition with maximum 
at w(ab,cd). It remains to check that if D satisfies the weight condition 
with maximum at w(ab,cd), then it satisfies the four-point condition with 
minimum at D(a,b) + D(c,d). Thus assume that w(ab,cd) > w(ac,bd) = 
w(ad,bc). Then 

= w(ac, bd) - w(ad, be) = 3(D(a, d) + D(b, c)) - 3(D(a, c) + D(b, d)). 
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Hence we obtain D(a, d) + D(b, c) = D(a, c) + D(b, d). By plugging this into 
the definition of w(ac, bd) we obtain 



> w{ac, bd) = D(a, b) + D(c, d) - D(a, c) - D(b, d), 

whence D(a,b) + D(c,d) < D(a,d) + D(b,c) = D(a,c) + D(b,d), which 
completes the proof. □ 



3 Expected internode distances on gene trees 

We consider a taxon set T = {t±, . . . , tjv} containing N taxa, and a species 
tree S on T. Assume that for each taxon t £ T we have sampled n(t) 
copies of a given locus, denoted L^x, . . . , L ijTl u.) for each i = I, . . . , N . Let 
C = Uj-fX^i, . . . , L itn r t .\}. We follow the convention to denote the elements 
of £ by capital letters, while leaves on the species tree S (i.e. the elements 
of T) are denoted by lower case letters. 

For each rooted binary tree G on the leaf set C Liu and Yu define in [3] 
the internode distance between leaves I and J to be the number of nodes 
which lie on the path between I and J in G (I and J are not counted). This 
number, which we denote Ig(I, J), induces a metric on C and thus a weight 
Wg(U, KL) := wi G (IJ, KL) for each four element subset {J, J, K, L} C C. 
Of course, for any G the metric Iq is tree-like with underlying tree G. 

Definition 6. A coalescence pattern associated with the species tree S and 
the vector of multiplicities (n(ii), . . . , ra(tjv)) is a rooted tree G with leaf set 
C together with a map 

f : Nodes(G) -»■ Nodes(S) 

with the following two properties: (1) For every Lij € C we have f(Lij) = 
ti S T, and (2) For any two nodes m,n € Nodes{G), if n is a descendant 
of m in G, then f{n) is a descendant of f(m) in S. We denote coalescence 
patterns as pairs (G, f) in the sequel. 

A coalescence pattern is basically the same as what Degnan and Salter 
[2] call a (valid) coalescent history. Each coalescence pattern (G, f) (asso- 
ciated with S and a multiplicity vector v = (n(U))i) occurs with a certain 
probability P(G, f) under the multispecies coalescent, which is calculated in 
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the case n(t) = 1 for all t by Degnan and Salter in loc. cit. This makes the 
set of coalescence patterns (for fixed S and vl) a probability space, and the 
internode distance Ig{I, J) for each gene tree G between two leaves /, J G £ 
induces a random variable on this probability space, which we denote by 
ID (I, J). By abuse of language we call this random variable also 'internode 
distance' between / and J. Similarly for the weight of a quartet (IJ, KL) for 
/, J,K,L G C: We denote the corresponding random variables W(IJ, KL), 
for each quartet (IJ,KL). 

Thus the following numbers, associated with S and v, are well-defined 
(here and in the following we suppress the dependence on S and v in the 
notation, though we want to stress once more that all this requires chosing 
and fixing a multiplicity vector v !): 

D(I, J) = E(ID(I, J)) = P (G, f) ■ I G (I, J), 

(GJ) (3.1) 
E(W(IJ,KL)) = P(G,f)-W G (IJ,KL), 

(GJ) 

the expected internode distance between two leaves I, J £ C, and the ex- 
pected weight of a quartet (IJ, KL) under the multispecies coalescent model. 

Lemma 7. (1) Since for each G the function Iq is a metric, so is D = E(I) . 
(2) Hence the weight function wd is defined and satisfies wd(U,KL) = 
E(W(IJ,KL)). 

Proof. Both claims are immediate consequences of linearity of expected val- 
ues. □ 

Finally, we note that the expression D(I, J) does not really depend on 
the leaves /, J, but only on the taxa in T they belong to. Thus we have 
defined a metric 

D(i,j) G M>o, for any two taxa i,j G T, 
as well as a weight (depending on D) 

WD{iji kl) G R, for any four taxa i, j, k,l G T. 
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Combining Equation (|3,ip with Lemma [7] we obtain that we may calcu- 
late the weight wo{ij, kl) for four taxa i,j, k,l G T as 



w D (ij,kl) = ^ P(G,f) • W G (/J,ifL) 

(G,f) 



(3.2) 



where (G, /) runs through all possible coalescence patterns, and where / G £ 
is any locus corresponding to i <E T, J any locus corresponding to j € T and 
so forth. 

Theorem 8 (Liu and Yu, [1], Theorem Al). The metric D = E(I) is 
tree-like, and the underlying additive tree has the same topology as the true 
species tree S. 

For the proof we introduce a little piece of notation: For any rooted tree 
T and a finite subset of leaves l\, . . . , Ik we denote by Mf{l\, ■■■■,1k) the 
most recent common ancestor of l\, . . . , If- in T ■ 

Proof. By Lemma [5] it suffices to check that, if the species tree S displays 
the quartet (ab,cd), then the following holds: 



This is relatively easy to check using equation (j3.2|) . We thus assume that 
S displays the quartet (ab,cd), and we consider gene lineages A,B,C,D 
sampled from the respective taxa. We have to distinguish two cases, namely: 
(1) The (rooted) subtree S' of S with leaf set {a, b, c, d} has the shape of a 
caterpillar tree, and (2) S' has the balanced shape. In case (1) we assume 
without loss of generality that S' has the topology (((a,b),c),d), while in 
the second S' must have the topology ((a, b), (c, d)). S' has the balanced 
shape. In case (1) we assume without loss of generality that S 1 has the 
topology (((a, b), c), d), while in the second case S' must have the topology 



We now partition the set of coalescence patterns (G, f) into two disjoint 
subsets X and Y: In case (1) we define 



wo(ab, cd) > WD(ac, bd) = wo(ad, be). 



((a,6),( C) d)). 



X = {(G, /) | f(M G (A, B)) is ancestral to M s (a, b, c 
Y = {(G,f) | (G,f) £ Q}. 



■)} 



(3.3) 
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Note that {G, /) G P then means that the lineages A and B coalesce below 
the point where the populations c merges with the population ancestral to 
a and b. In case (2) we set 

X = {(G,f) | f(M G (A,B)) and f{M G {C,D)) 

are both ancestral to Ms{a, b, c, d)}, (3-4) 
Y = {(G,f) | {GJ)iQ}. 

Consider the case of a coalescence pattern {G\,f±) G X. Then the lin- 
eages of A, B and C on G\ enter the population above Ms (a, b, c) separately. 
Hence, by permuting the lineages A, B and C we obtain coalescence pat- 
terns (G 2 , f 2 ), (G 3 , / 3 ) G X such that P(Gi,/i) = P(G 2 ,/ 2 ) = P{G 3 ,h), 
and such that, after possibly renumbering of the coalescence patterns, G\ 
displays (AB, CD), G 2 displays {AC, BD) and G 3 displays {AD, BC), and 
such that 

-2W Gl {AC,BD) = -2W Gl {AD,BC) = W Gl {AB,CD) = x, 

-2W G2 {AB, CD) = -2W G2 {AD, BC) = W G2 {AC, BD) = x, (3.5) 

-2W Gs {AB,CD) = -2W Gs {AC,BD) = W G . A {AD,BC) = x, 

where x is the number of nodes on the path connecting the path between A 
and B, and C and D, respectively, in G\. 

On the other hand, if {G, f) G Y, then G necessarily displays the quartet 
{AB, CD). Hence for such G we have 

W G {AB, CD) > 0, while 

1 (3-6) 
W G {AC,BD) = W G {AD,BC) = --W G {AB,CD). 

Now recall equation (|3.2p and write 

w D {ij,kl) = ^ P(G,f) ■ W G {IJ,KL) = 

(G,f) 

= P{GJ)-W G {IJ,KL)+ Y, P{G,f)-W G {IJ,KL) 

(G,f)ex (Gj)eY 

From Equation (|3.5[) we see that in the expressions WD{ab,cd), WD{ac,bd) 
and WD{ad,bc) the sum over the {G, f) G X vanishes, and equation (|3.6p 
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further implies that 

P{G, f) ■ W G (AB, CD) > 0, while 

(GJ)eY 

E P(G,f)-~W G (AB,CD) = 

(GJ)eY 

(3.7) 

This shows that D satisfies the weight condition, with the maximum attained 
for the quartet (ab, cd) . Invoking Lemma completes the proof. □ 
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